2022-05-06
In research, missing data occur when a data value is unavailable. Many empirical studies encounter missing data. Missing data can occur in many stages of research due to many different causes in many different forms.
Each type of missing data may have different reasons, and also different implication for the methods to deal with the missing data.
The underlying causes of missing data as missing data mechanisms and were first described by Rubin (1976).
Rubin distinguished three missing data mechanisms:
Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.
In other words the missingness on the variable is completely unsystematic.
Below the description of the complete data example. We will use this example to show the implications of each missing data mechanism.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 1000 | 0.00 | 2.21 | 0.07 | 0.02 | 2.25 | -7.42 | 6.69 | 14.11 | -0.13 | -0.13 | 0.07 |
| X2 | 2 | 1000 | 0.09 | 2.26 | 0.14 | 0.12 | 2.24 | -6.73 | 6.58 | 13.31 | -0.12 | -0.15 | 0.07 |
| X3 | 3 | 1000 | -0.03 | 2.25 | 0.02 | -0.04 | 2.27 | -7.27 | 6.54 | 13.81 | 0.00 | -0.22 | 0.07 |
When we create MCAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have not changed much:
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 1000 | 0 | 2.21 | 0.07 | 0.02 | 2.25 | -7.42 | 6.69 | 14.11 | -0.13 | -0.13 | 0.07 |
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 490 | 0.01 | 2.1 | 0.02 | 0.04 | 2.14 | -7.38 | 5.58 | 12.96 | -0.18 | -0.1 | 0.09 |
We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.
mcar <- mcar %>% mutate(R1 = is.na(X1))
Missing data are MAR when the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing values itself.
For example, older people more often have missing values for IQ. In that case the probability of missing data on IQ is related to age.
When we create MAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 1000 | 0 | 2.21 | 0.07 | 0.02 | 2.25 | -7.42 | 6.69 | 14.11 | -0.13 | -0.13 | 0.07 |
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 512 | -0.28 | 2.25 | -0.27 | -0.26 | 2.3 | -7.42 | 5.34 | 12.76 | -0.13 | -0.25 | 0.1 |
We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.
mar <- mar %>% mutate(R1 = is.na(X1))
The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.
Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.
For example, when weight data are missing mostly for the more heavy persons.
When we create MNAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 1000 | 0 | 2.21 | 0.07 | 0.02 | 2.25 | -7.42 | 6.69 | 14.11 | -0.13 | -0.13 | 0.07 |
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 488 | -1.12 | 1.96 | -1.17 | -1.18 | 1.53 | -7.42 | 5.11 | 12.53 | 0.26 | 0.73 | 0.09 |
We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.
mnar <- mnar %>% mutate(R1 = is.na(X1))
The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.
Any information about the research process can provide valuable information that helps to evaluate and make assumptions about the missing data mechanism.
Why are data missing?
The missing data mechanisms are defined by the probability that missing data occur.
Probability is not related to other measured variables
Other measured variables are related tot the probability of missing data
The essence of testing for MCAR is to compare the group with missing data to the group without missing data.
Univariate testing
Multivariate testing
Independent samples T-test to compare the mean of continuous variables between the group with missing data to the group without missing data.
Note that the T-test assumes normally distributed data and homogeneity of variance.
t.test(X2 ~ R1, data = mcar)
## ## Welch Two Sample t-test ## ## data: X2 by R1 ## t = -0.11866, df = 996.42, p-value = 0.9056 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## -0.2970947 0.2632135 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 0.07756972 0.09451031
t.test(X2 ~ R1, data = mar)
## ## Welch Two Sample t-test ## ## data: X2 by R1 ## t = -13.814, df = 994.06, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## -2.064496 -1.550917 ## sample estimates: ## mean in group FALSE mean in group TRUE ## -0.7959512 1.0117550
t.test(X2 ~ R1, data = mnar)
## ## Welch Two Sample t-test ## ## data: X2 by R1 ## t = -3.9996, df = 973.87, p-value = 6.824e-05 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## -0.8467042 -0.2893187 ## sample estimates: ## mean in group FALSE mean in group TRUE ## -0.2046124 0.3633990
Univariate method.
When there are no significant differences we may assume the data are MCAR. Otherwise, we assume not-MCAR (i.e. MAR or MNAR).
Note that we can never truly rule out MNAR.
Chi-square test to compare the categorical variables for the group with missing data to the group without missing data.
Test to compare the distribution over the categories between the groups.
Note that the Chi-square test assumes that the expected cell frequencies should not be too small.
mcar <- mcar %>% mutate(X3c = ifelse(X3 > 0, 1, 0)) chisq.test(mcar$R1, mcar$X3c)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: mcar$R1 and mcar$X3c ## X-squared = 0.062121, df = 1, p-value = 0.8032
mar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0)) chisq.test(mar$R1, mar$X3c)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: mar$R1 and mar$X3c ## X-squared = 80.787, df = 1, p-value < 2.2e-16
mnar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0)) chisq.test(mar$R1, mnar$X3c)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: mar$R1 and mnar$X3c ## X-squared = 80.787, df = 1, p-value < 2.2e-16
The probability of missing data can also be investigated in a logistic regression analysis.
The missing data indicator is the dependent variable and the other variables that may be related to the probability of missing data are the independent variables.
The results of the logistic regression analysis show if the independent variables relate to the probability of missing data.
Note that when the other variables have missing values as well, a complete-case analysis is used per default.
glm(R1 ~ X2 + X3, data = mcar) %>% summary %>% coefficients %>% round(.,3)
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.510 0.016 32.171 0.000 ## X2 0.002 0.007 0.277 0.782 ## X3 -0.006 0.007 -0.870 0.384
glm(R1 ~ X2 + X3, data = mar) %>% summary %>% coefficients %>% round(.,3)
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.483 0.014 34.573 0 ## X2 0.078 0.006 12.460 0 ## X3 0.056 0.006 8.909 0
glm(R1 ~ X2 + X3, data = mnar) %>% summary %>% coefficients %>% round(.,3)
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.483 0.014 34.573 0 ## X2 0.078 0.006 12.460 0 ## X3 0.056 0.006 8.909 0
In the MCAR example both X2 and X3 are not related to the probability of missing data in X1, so we may assume that the missing data in X1 are MCAR.
However, in the MAR example, both variables are related tot he probability of missing data in X1, so in that case we can assume that the data are not-MCAR.
We cannot rule-out MNAR in this situation, since cannot test the missing values itself.
A multivariate test that evaluates the subgroups of the data that share the same missing data pattern.
Per subgroup (with same missing data pattern): observed means versus estimated means based on the expectation-maximization algorithm.
Chi-square distribution test to test the null hypothesis that data are MCAR.
A significant result shows that the data are not-MCAR.
misty::na.test(mcar %>% select(X1:X3))
## Little's MCAR Test ## ## n nIncomp nPattern chi2 df pval ## 1000 510 2 0.77 2 0.679
misty::na.test(mar %>% select(X1:X3))
## Little's MCAR Test ## ## n nIncomp nPattern chi2 df pval ## 1000 488 2 222.52 2 0.000
misty::na.test(mnar %>% select(X1:X3))
## Little's MCAR Test ## ## n nIncomp nPattern chi2 df pval ## 1000 488 2 222.52 2 0.000
No specific information about which variables are related to the probability of missing data.
Test assumes multivariate normality and can only be applied to continuous variables.
The MNAR mechanism can never be ruled out, regardless of the result of the test.
The methods to deal with missing data, implicitly assume a missing data mechanism.
MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.
MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).
MNAR: least strict assumption.